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Abstract 

Background: The wide use of high-throughput DNA microarray technology provide an increasingly detailed view 
of human transcriptome from hundreds to thousands of genes. Although biomedical researchers typically design 
microarray experiments to explore specific biological contexts, the relationships between genes are hard to 
identified because they are complex and noisy high-dimensional data and are often hindered by low statistical 
power. The main challenge now is to extract valuable biological information from the colossal amount of data to 
gain insight into biological processes and the mechanisms of human disease. To overcome the challenge requires 
mathematical and computational methods that are versatile enough to capture the underlying biological features 
and simple enough to be applied efficiently to large datasets. 

Methods: Unsupervised machine learning approaches provide new and efficient analysis of gene expression 
profiles. In our study, two unsupervised knowledge-based matrix factorization methods, independent component 
analysis (ICA) and nonnegative matrix factorization (NMF) are integrated to identify significant genes and related 
pathways in microarray gene expression dataset of Alzheimer's disease. The advantage of these two approaches is 
they can be performed as a biclustering method by which genes and conditions can be clustered simultaneously. 
Furthermore, they can group genes into different categories for identifying related diagnostic pathways and 
regulatory networks. The difference between these two method lies in ICA assume statistical independence of the 
expression modes, while NMF need positivity constrains to generate localized gene expression profiles. 

Results: In our work, we performed FastICA and non-smooth NMF methods on DNA microarray gene expression 
data of Alzheimer's disease respectively. The simulation results shows that both of the methods can clearly classify 
severe AD samples from control samples, and the biological analysis of the identified significant genes and their 
related pathways demonstrated that these genes play a prominent role in AD and relate the activation patterns to 
AD phenotypes. It is validated that the combination of these two methods is efficient. 

Conclusions: Unsupervised matrix factorization methods provide efficient tools to analyze high-throughput 
microarray dataset. According to the facts that different unsupervised approaches explore correlations in the high- 
dimensional data space and identify relevant subspace base on different hypotheses, integrating these methods to 
explore the underlying biological information from microarray dataset is an efficient approach. By combining the 
significant genes identified by both ICA and NMF, the biological analysis shows great efficient for elucidating the 
molecular taxonomy of Alzheimer's disease and enable better experimental design to further identify potential 
pathways and therapeutic targets of AD. 
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Background 

With the widespread use of the DNA microarray tech- 
nology in the study of biology and medicine, large 
amount of gene expression data have been easily accu- 
mulated. The main challenge now is to extract valuable 
biological information from the colossal amount of data 
to gain insight into biological processes and the 
mechanisms of human disease. Modern signal proces- 
sing and machine learning techniques provide new and 
efficient analysis tools to the high-throughput microar- 
ray gene expression dataset. 

In the Gene expression context, similar genes exhibit 
similar behaviours only under a subset of conditions, not 
all conditions; and genes may participate in more than 
one function, resulting in one regulation pattern in one 
context and a different pattern in another. Researchers 
have developed various methods for clustering and identi- 
fying groups of genes or experimental conditions that 
exhibit similar expression patterns, such as k- means [1], 
self-organizing maps (SOM) [2,3] and hierarchical clus- 
tering (HC) [4]. However, these clustering algorithms suf- 
fer from two limitations, the one is they group genes (or 
conditions) based on global similarities in their expres- 
sion profiles; the other is they only assign each gene to a 
single cluster. This is difficult to be interpreted by the 
biologists due to the large number of genes and complex 
underlying inter-gene dependency. 

Recently, many biclustering methods have been pro- 
posed to avoid the drawbacks of the standard clustering 
algorithms. They can simultaneously cluster genes and 
conditions to obtain sets of genes that are co-regulated 
under subsets of conditions [5,6]. As particularly pro- 
mising knowledge-based matrix factorization techniques, 
independent component analysis (ICA) and nonnegative 
matrix factorization (NMF) have been successfully 
applied to the biomedical applications to uncover biolo- 
gically meaningful patterns from the data. 

As an unsupervised analysis method, ICA uses the 
existence of independent facts in multivariate data and 
decomposes an input data set into statistically indepen- 
dent component. ICA-based gene classification method 
was first proposed by Hori,G. et al (2001) to classify 
yeast gene expressions during sporulation into biologi- 
cally meaningful groups [7]. Liebermeister W. (2002) 
also presented ICA method for microarray analysis to 
extract expression modes, where each mode represents 
a linear influence of a hidden cellular variable [8]. After 
that, ICA has been widely used in microarray data for 
feature extraction, clustering and the classification on 
yeast cells' cycle [9] and cancer data such as: ovarian 
cancer [10], breast cancer [11], endometrial cancer [12], 
colon and prostate cancer [13] and acute myeloid leuke- 
mia [14], etc. 



For ICA decomposition, the gene expression data pro- 
vided by microarray technology is considered a linear 
combination of basis vectors and some independent 
expression modes with both positive and negative 
expression levels. By grouping analysis on each expres- 
sion mode, significant genes correspond to different sin- 
gle pathways can be identified. 

Compare with ICA, NMF decomposes the input data 
as a product of two matrices that are constrained by 
having nonnegative elements. It is firstly proposed by 
Lee and Seung [15] to decompose human face images 
and they achieved meaningful part-based representation 
due to only additive, not subtractive combinations are 
allowed. Recently, NMF has been successfully used in 
feature extraction and dimensionality reduction of gene 
expression data. This method can simultaneously cluster 
genes and conditions to obtain sets of genes that are co- 
regulated under subsets of conditions. 

However, there is no explicit guarantee to support the 
data has a unique representation in terms of positive 
factors. In the past several years, many methods have 
been developed to achieve further sparseness to the 
basis factors or encoding vectors or both of them. The 
sparseness of both basis factors or encoding vectors are 
especially welcomed in biological data analysis because 
both mutual independent biological process extract 
from gene set and efficient classification from sample 
set are expected. Several improvements to the standard 
NMF such as, local NMF (LNMF) [16], nonnegative 
sparse coding (NNSC) [17], sparse NMF (SNMF) [18], 
NMF with sparseness constraints (NMFSC) [19] and 
non-smooth NMF [20]. 

In this study, we explore the potential of both ICA 
and improved NMF on DNA gene expression data to 
identify significant genes of Alzheimer's disease (AD). 
By using these methodology to decompose AD samples 
into expression modes (ICA) and metagenes (NMF), 
and combining the gene selection results from ICA and 
NMF, the identified genes and the underlying biological 
processes show many clear related pathways in AD and 
the activation patterns to AD phenotypes. This hence 
can help biologists to understand the phenotype-path- 
way relationship and help to find the biomarker genes 
for AD. 

Methods 

The microarray gene expression data are traditionally 
represented as a nxm matrix X with n genes under m 
samples or conditions. The columns of the dataset 
denote gene expression signatures (GES) of n genes dur- 
ing m experiments or environmental conditions while 
each row of matrix X represent gene expression profiles 
(GEP) of each gene across all m conditions. 
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ICA analysis on AD gene expression data 

Let the nxm matrix X denote the microarray gene 
expression data with n genes under m samples or condi- 
tions. %ij in X is the expression level of the i-Xh gene in 
the y-th sample. Generally speaking, the number of 
genes n is much larger than that of the samples m, n 
»m. Therefore, the transform, X T , was often used in 
ICA model. Suppose that the data have been prepro- 
cessed and normalized, i.e. each sample has zero mean 
and standard deviation, then the ICA model for gene 
expression data can be expressed as: 
X T = AS (1) 

where, the columns of A = [a lt a 2 $a n ] are the mxm 
basis vectors (latent vectors) of the gene microarray 
data, the rows of S denotes m gene signatures or expres- 
sion modes, and every row of S is statistically indepen- 
dent to each other. To obtain S and A, the demixing 
model can be expressed as 

Y = WX (2) 

Where Wis an mxm demixing matrix. Figure 1 shows 
a example of ICA on AD gene expression data, detail 
representations will be given in the next section. 

The gene expression data provided by microarray 
technology is considered a linear combination of inde- 
pendent expression modes having specific biological 



interpretations. The m-th row matrix A contained the 
weights with which the expression levels of the n genes 
contribute to the m-th observed expression profile. 
Hence the assignment for the observed expression pro- 
files with different classes is valid for the rows of A, and 
each column of A can be associated with one specific 
expression mode. Since the n-th column of A contains 
the weights with which s n contributes to all observa- 
tions, this column should show large or small entries 
according to the class labels. After such characteristi- 
cally latent variables have been obtained, the corre- 
sponding elementary modes can be identified to yield 
useful information for classification. Also, the distribu- 
tion of gene expression levels generally features a small 
number of significantly overexpressed or underexpressed 
genes that form very biologically coherent groups and 
may be interpreted in terms of regulatory pathways. 

NMF analysis on AD gene expression data 

NMF assumes the given gene expression data is com- 
bined of positive metagenes with meaningful local biolo- 
gical representation instead of statistically independent 
expression modes in ICA. This method can be applied 
to reduce the dimensionality of data and represent the 
original data as a linear combination of a reduced set of 




Figure 1 ICA decomposes AD microarray gene expression data X (13x6398) is AD microarray gene expression data matrix with 13 samples 
and 6398 genes, A is 13x13 basis vector and S is expression mode(1 3x6398). Here columns of A and rows of S are separated for a clear visibility. 



Kong et al. BMC Bioinformatics 201 1, 12(Suppl 5):S7 
http://www.biomedcentral.eom/1 471 -21 05/1 2/S5/S7 



Page 4 of 9 



/r-factors. The nonnegtive matrix factorization can be 
described as follow: 

V « WH (3) 

Where V is a positive matrix (the microarray gene 
expression data) of size nxm with a desired reduced 
rank k (k < n), W are nxk basis factors (also known as 
basis experiments [13] or metagenes [14]) with nonneg- 
tive elements, and H contains the nonnegative coeffi- 
cients (encoding vectors) of the linear combinations of 
the basis factors with size of kxm. 

However, there is no explicit guarantee that the repre- 
sentation of positive factors is unique. From many 
improved work, non-smooth NMF (nsNMF) is applied 
in our work for the AD gene expression data since it is 
reported to be able to enforce sparseness to both basis 
factors and encoding vectors by adding a modification 
variable into the classical NMF model [20]. nsNMF is 
defined as: 

V « WSH (4) 

where V, W, and H are the same as in the classical 
NMF model. S is a kxk positive symmetric matrix and 
to be defined as a "smoothing" matrix: 

S = (1-0)I + -11 T (5) 

where I is the identity matrix, 1 is a column vector of 
ones, 11 T is a kxk matrix with all items are ones, and 
the parameter 0 satisfies 0 < 6 <1. Given a positive non- 
zero vector X, multiplying the smoothing matrix S to 
vector X, SX, if 6 - 0, then the result is no smoothing 
on X. If 6 — > 1, the result of SX tends to the constant 
vector with all elements almost equal to the average of 
the elements of X. This is the smoothest possible vector 
because all entries are equal to the same nonzero value, 
instead of sparseness which having some large values 
and others close to zero. 

According to the equivalently written form of Eq. (4): 
V *{WS)H =W{SH), for a given 0 < 9 <1, the iteration 
of the algorithm can be done by substituting W by WS 
when updating for H, and substituting H by SH when 
updating for W. It can simultaneously sparse W and H 
when convergence. 

Given a NMF decomposition result, we can use matrix 
H to classify m samples into to k classes. Each sample is 
clustered into a group corresponding to the most highly 
expressed metagenes in the sample. The nature merit of 
nsNMF is that the sparseness reinforces the genes and 
experiments that significantly sustain and represent the 
factors. Each original sample is represented by columns 
of basis factors with the corresponding encoding vector 
(the column of H). It is clear that large value of both 
basis factors and encoding vectors play important role 
in representing of the original data. Figure 2 gives a 



NMF factorization example for the same input AD data 
in ICA experiment. 

Results 

To evaluate ICA and nsNMF applied to AD DNA gene 
expression data, we used the data set of hippocampal 
gene expression of control and AD samples from GEO 
DataSets offered by Eric M. Blalock [21]. The hippocam- 
pal specimens they used were obtained through the 
Brain Bank of the Alzheimer's Disease Research Center 
at the University of Kentucky. The human GeneChips 
(HG-U133A) of Affymetrix and Microarray Suite 5 were 
used in analyzing the microarray data. The procedures 
for total RNA isolation, labeling and microarray were 
described in [21] and [22]. We excluded the samples 
with significant noise and chose 8 control and 5 severe 
AD samples with 6398 genes to test FastICA(http:// 
www.cis.hut.fi/projects/ica/fastica/) [23] and nsNMF 
method. 

FastICA decomposes AD gene expression data matrix 
X into latent variable matrix A (13x13) and gene signa- 
ture matrix S (13x6398). In FastICA algorithm, nonlinear 
function g(u)=tanh(al*u) was used as the probability den- 
sity distribution of the outputs during the iteration, where 
here al is a constant. As the FastICA algorithm relies on 
random initializations for its maximization and faces the 
problem of convergence to local optima, we iterated Fas- 
tICA 100 times to alleviate the instability of the slightly 
different results in each iteration. 

In FastICA result, each row of matrix A contained the 
weights with which the expression levels of the n genes 
contribute to the corresponding observed expression 
profile (row of matrix X). Therefore, the profile order of 
rows of A is the same as that of the observed expression 
profiles, and each column of A is associated with the 
corresponding gene signature (IC). The original data set 
consisted of 5 severe AD microarray gene expressions 
(first 5 rows) and 8 control samples (last 8 rows), so this 
assignment is same as the rows of A. Figure 3 shows the 
Hinton diagram representation of matrix A. It is clear 
that the 7-th and 9-th columns of A can discriminate 
between severe AD and the control samples since the 
sign and the values are distinctly different (see Figure 4 
and 5): the first 5 values (for severe AD samples) of the 
7-th column of A are negative and most of the last 8 
values (for control samples) are positive, and so does 
the 9-th column of A. It can be drawn that the corre- 
sponding gene signatures (the 7-th and 9-th row of 
matrix S) give strong contribution to the component. By 
giving a threshold=2.5, the positive and negative load- 
ings correspond to up- and down-regulation of expres- 
sion can be thought as significant genes for diagnosing 
AD. We extract more than 200 significant genes to 
further identification and biological analysis. 
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V « W H 

m samples m samples 




metaexperiments 



metagenes 

Figure 2 The nsNMF decomposition of AD microarray gene expression data.!/ (13x6398) is AD microarray gene expression data matrix 
with 13 samples and 6398 genes, W (6398x2) is metagene matrix and H (2x6398) is the encoding vectors, here columns of W and rows of H 
are separated for a clear visibility. 
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Figure 3 Hinton diagram representation of latent variable matrix A. The size of each square corresponds to the amount o nm of component 
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In nsNMF experiment, we ran 40 times for /<=2, and 
2000 iterations were used when convergence. Each 
metagene achieved by nsNMF is sparser and hence con- 
tains a relatively small set of genes with non-zero coeffi- 
cients that determine different local gene expression 
features. At the same time, the encoding vectors (rows 
of H) are used to determine the classification of the 
samples which are highly associated to these local biolo- 
gical processes. By sorting the original data set by the 
descending order of each metagene and the correspond- 
ing encoding vector, we can obtain the local gene 
expression modules. Figure 6 is the sorting results by 
first and second basis factor and metaexperiment 



respectively. The reconstructed results show that the 
local structures obtained are accumulated at the upper 
and bottom parts of the two figures which correspond- 
ing to the up-regulated and down-regulated in the 
severe AD. There are more than 1500 significant genes 
are identified by nsNMF which related to the pathways 
of AD. 

Discussion 

The biologically analysis of the identified significant 
genes by ICA and NMF are descried as below: 

From ICA factorization results, we identify more than 
100 significant genes and their related pathways play a 
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Figure 5 The FastICA result of 9 th column of A. The 9 th column of A which shows a strong discrimination of control and severe AD samples 
(upper figure) and the corresponding 9 th row of the expression mode (lower figure). 
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Figure 6 Raw input data reconstructed by sorted nsNMF factors, (a) original dataset sorted by the first metagene (the first column of W) 
and the first metaexperiment (the first row of H), (b) original dataset sorted by the second metagene (the second column of W) and the 
second metaexperiment (the second row of H). The genes marked A, B, C, D are selected gene set which significantly express in severe AD. 



prominent role in AD and relate the activation patterns 
to AD phenotypes. More than 50 upregulated significant 
genes in severe AD were extracted in immunoreactions, 
metal protein, membrane protein, lipoprotein, neuro- 
peptide, cytoskeleton protein, binding protein and ribo- 
somal protein. Especially, in immunoreactions, genes 
such as: AMIG02, BTG1, CD24, CD44, CDC42EP4, 
IFITM1, IFITM2, IRF7, IFI44L, IL4R, IRAKI and 
NFKBIA were found upregulated in severe AD; and 
CD22 / MAG were downregulated. Importantly, we 
found the novel associations liking inflammation and 
the expression of APP by the molecular biological 
experiments. Moreover, in metal protein, ICA found 
many high expressed genes such as: CAMK2B, CALM1, 
CAPZA2, CHGB, LOC728320/LTF, MPPE1, MT1F, 
MT1M, MBP, SCGN, SLC24A3, SLC7A11, ZIC1, 
ZBTB20, ZNF500, ZNF580, ZNF652 and ZNF710. ICA 
also found more than 50 downregulated significant 
genes in above categories and some oncogenes and 



phosphoricproteins were low expressed in severe AD 
such as: CABP1, CACNG3, CAMK2B, CAMK1G, 
CAPZB, MET, ZNF365 and TFRC. 

By studying the NMF decomposition results, one of 
the interesting things we found is there are not many 
genes they can be identified both by ICA and NMF. 

By NMF, lots of metal metabolism and inflammation 
related genes were identified, such as, more than 230 up- 
regulated genes are found to be conjunction with Zinc. 
We also identified 321 down- regulated genes in the first 
metagene (the first column of W) and 263 down-regu- 
lated genes in second metagene (the second column of 
W) which are conjunction of metabolism. Compared 
with the control samples, there are 77 up-regulated 
genes, 76 down-regulated genes in first metagene and 
87 up-regulated genes, 67 down-regulated genes in sec- 
ond metagene are conjunction of calcium (Ca (2+)). In 
Alzheimer's disease, calcium permeability through cellular 
membranes appears to underlie neuronal cell death. 
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Furthermore, a big cluster of genes related to inflam- 
mation are founded which are presented up-regulated in 
severe AD samples. 

Finally, both in the factor 1 and 2 show that the genes 
inhibiting cell grows and the genes promoting apoptosis 
are up-regulated. Such as: Programmed cell death 6, 
deleted in liver cancer 1, cyclin N-terminal domain con- 
taining 2, killer cell immunoglobulin-like receptor, p21, 
p53. Whereas, those genes which speed up the cell 
cycle, promote cellular fission and cell repair are shown 
to be down-regulated. 

Conclusions 

In this study microarray gene expression data have been 
analyzed by applying unsupervised knowledge-based 
matrix factorization techniques, ICA and NMF. Either 
the expression modes detected by ICA or the metagenes 
extracted by NMF, are able to assign gene into different 
informative biological processes which made the biclus- 
tering of genes and conditions simultaneously becomes 
possible. 

The identified significant genes by these two methods 
for AD microarray gene expression data shows that few 
genes are the same. That means the genes with high 
expression level in the significant expression modes in 
ICA may not play an important role in metagenes in 
case of NMF. The reason might be they have different 
convergence constraints. ICA assume statistical indepen- 
dence of the expression modes, while NMF need posi- 
tivity and sparseness constrains of metagenes and 
metaexperiments to generate localized gene expression 
profiles. According to their merits, we can expect to 
seek more significant genes for diagnosing an disease by 
applying ICA and NMF on microarray data respectively 
and integrating their results together. 

This approach is demonstrated to be valid in our 
experiments. The significant up-regulation or down-reg- 
ulation genes presented in the expression modes (ICA) 
and local molecular patterns (NMF) shows that they can 
classify samples efficiently and match well the clinical 
symptoms of Alzheimer disease. This tool can uncover 
meaningful biological interpretation to help biologists to 
understand the phenotype-pathway relationship and 
thus help elucidate the molecular taxonomy of AD. 
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